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Abstract 

Identifying moving objects in a video sequence, which is produced by a static camera, is 
a fundamental and critical task in many computer-vision applications. A common approach 
performs background subtraction, which identifies moving objects as the portion of a video 
frame that differs significantly from a background model. A good background subtraction 
algorithm has to be robust to changes in the illumination and it should avoid detecting non- 
stationary background objects such as moving leaves, rain, snow, and shadows. In addition, the 
internal background model should quickly respond to changes in background such as objects 
that start to move or stop. 

We present a new algorithm for video segmentation that processes the input video sequence 
as a 3D matrix where the third axis is the time domain. Our approach identifies the background 
by reducing the input dimension using the diffusion bases methodology. Furthermore, we 
describe an iterative method for extracting and deleting the background. The algorithm has 
two versions and thus covers the complete range of backgrounds: one for scenes with static 
backgrounds and the other for scenes with dynamic (moving) backgrounds. 
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1 Introduction 



Video surveillance systems, tracking systems, statistical packages that count people, games, etc. 
seek to automatically identify people, objects, or events of interest in different environment types. 
Typically, these systems consist of stationary cameras, that are directed at offices, parking lots, 
playgrounds, fences and so on, together with computer systems that process the video frames. 
Human operators or other processing elements are notified about salient events. There are many 
needs for automated surveillance systems in commercial, law enforcement, and military applica- 
tions. In addition to the obvious security applications, video surveillance technology has been 
proposed to measure traffic flow, detect accidents on highways, monitor pedestrian congestion in 
public spaces, compile consumer demographics in shopping malls and amusement parks, log rou- 
tine maintenance tasks at nuclear facilities, and count endangered species. The numerous military 
applications include patrolling national borders, measuring the flow of refugees in troubled areas, 
monitoring peace treaties, and providing secure perimeters around bases. 

Substraction of backgrounds, which are captured by static cameras, can be useful to achieve 
low-bit rate video compression for transmission of rich multimedia content. The subtracted back- 
ground is transmitted once, followed by the segmented objects which are detected. 

A common element in surveillance systems is a module that performs background subtraction 
to distinguish between background pixels, which should be ignored, and foreground pixels, which 
should be processed for identification or tracking. The difficulty in background subtraction is not to 
differentiate, but to maintain the background model, its representation and its associated statistics. 
In particular, capturing the background in frames where the background can change over time. 
These changes can be moving trees, leaves, water flowing, sprinklers, fountains, video screens 
(billboards) just to name a few typical examples. Other forms of changes are weather changes 
like rain and snow, illumination changes like turning on and off the light in a room and changes 
in daylight. We refer to this background type as dynamic background (DBG) while a background 
without changes or with slight changes is referred to as static background (SBG). 

In this paper, we present a new method for capturing the background. It is based on the ap- 
plication of the diffusion bases (DB) algorithm. Moreover, we develop real time iterative method 
for background subtraction in order to separate between background and foreground pixels while 
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overcoming the presence of changes in the background. The main steps of the algorithm are: 

• Extract the background frame by dimensionality reduction via the application of the DB 
algorithm. 

• Subtract the background from the input sequence. 

• Threshold the subtracted sequence. 

• Detect the foreground objects by applying depth first search (DFS). 

We propose two versions of the algorithm - one for static background and the other for dynamic 
background. To handle dynamic background, a learning process is applied to data that contains 
only the background objects in order to generate a frame that extracts the DBG. The proposed 
algorithm outperform current state-of-the-art algorithms. 

The rest of this paper is organized as follows: In section [2j related algorithms for back- 
ground subtraction are presented. In section |3j we present the the diffusion bases (DB) algorithm. 
The main algorithm, that is called the background substruction algorithm using diffusion bases 
(BSDB), is presented in section |4j In section [5} we present experimental results, a performance 
analysis of the BSDB algorithm and we compare it to other background subtraction algorithms. 

2 Related work 

Background subtraction is a widely used approach for detection of moving objects in video se- 
quences that are captured by static cameras. This approach detects moving objects by differ- 
entiating between the current frame and a reference frame, often called the background frame, or 
background model. In order to extract the objects of interest, a threshold can be applied on the sub- 
tracted frame. The background frame should faithfully represent the scene. It should not contain 
moving objects. In addition, it must be regularly updated in order to adapt to varying conditions 
such as illumination and geometry changes. This section provides a review of the current state- 
of-the-art background subtraction techniques. These techniques range from simple approaches, 
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aiming to maximize speed and minimizing the memory requirements, to more sophisticated ap- 
proaches, aiming to achieve the highest possible accuracy under any possible circumstances. The 
goal of these approaches is to run in real-time. Additional references can be found in 03 El IS- 

Temporal median filter - 

In dUl, is was proposed to use the median value of the last n frames as the background 
model. This provides an adequate background model even if the n frames are subsampled 
with respect to the original frame rate by a factor of 10 [|5]|. The median filter is computed 
on a special set of values that contains the last n subsampled frames and the last computed 
median value. This combination increases the stability of the background model [0. 

A fundamental shortcoming of the the median-based approach is the need to store the recent 
pixel values in order to facilitate the median computation. Moreover, the median filter can 
not be described by rigorous statistics and does not provide a deviation measure with which 
the subtraction threshold can be adapted. 

Gaussian average - 

This approach models the background independently at each pixel location [6|. The 
model is based on ideally fitting a Gaussian probability density function (pdf) to the last n 
pixels. At each new frame at time t, a running average is computed by ip t = cel t + (l — a)tpt-i 
where I t is the current frame, ^ t -\ is the previous average and a is an empirical weight that 
is often chosen as a tradeoff between stability and quick update. 

In addition to speed, the advantage of the running average is given by a low memory re- 
quirement. Instead of a buffer with the last n pixel values, each pixel is classified using two 
parameters (^ t , <j t ), where a t is the standard deviation. Let p* ■ be the pixel at time t. 
p\ j is classified as a foreground pixel if |p* ■ — ipt-A > ka t . Otherwise p*j is classified as 
background pixel. 

Mixture of Gaussians - 

In order to cope with rapid changes in the background, a multi-valued background mode 
was suggested in Q. In this model, the probability of observing a certain pixel x at time 
t is represented by a mixture of k Gaussians distributions: P(x t ) = Ti^w^t^xt, ipi,t, 
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where for each 2-th Gaussian in the mixture at time t, w estimates what portion of the data 
is accounted for by this Gaussian, ip is the mean value, S is the covariance matrix and r\ is a 
Gaussian probability density function. In practice, k is set to be between 3 and 5. 

Each of the k Gaussian distributions describes only one of the observable background or 
foreground objects. The distributions are ranked according to the ratio between their peak 
amplitude Wi and their standard deviation <7j. Let Th be the threshold value. The first B dis- 
tributions that satisfy Hf =1 Wi > Th are accepted as background. All the other distributions 
are considered as foreground. 

Let I t be a frame at time t. At each frame I t , two events take place simultaneously: assigning 
the new observed value x t to the best matching distribution and estimating the updated model 
parameters. The distributions are ranked and the first that satisfies (x t — ipi,t) / &i,t > 2.5 is a 
match for x t . 

Kernel density estimation (KDE) - 

This approach models the background distribution by a non-parametric model that is based 
on a Kernel Density Estimation (KDE) of the buffer of the last n background values (L8J). 
KDE guarantees a smooth, continuous version of the histogram of the most recent values that 
are classified as background values. This histogram is used to approximate the background 
pdf. 

The background pdf is given as a sum of Gaussian kernels centered at the most recent n back- 
ground values, x t : P(x t ) = -£™ =1 77(:c t — Xj, E t ) where 77 is the kernel estimator function and 
£i represents the kernel function bandwidth. E is estimated by computing the median abso- 
lute deviation over the sample for consecutive intensity values of the pixel. Each Gaussian 
describes just one sample data. The buffer of the background values is selectively updated 
in a FIFO order for each new frame I t . 

In this application two similar models are concurrently used, one for long-term memory 
and the other for short-term memory. The long-term model is updated using a blind update 
mechanism that prevents incorrect classification of background pixels. 

Sequential kernel density approximation - 
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Mean-shift vector techniques have been proved to be an effective tool for solving a variety 
of pattern recognition problems e.g. tracking and segmentation ([HMTOl). One of the main 
advantages of these techniques is their ability to directly detect the main modes of the pdf 
from the sample data while relying on a minimal set of assumptions. Unfortunately, the 
computational cost of this approach is very high. As such, it is not immediately applicable 
to modeling background pdfs at the pixel level. 

To solve this problem, computational optimizations are used to mitigate the computational 
high cost (" IfTTTO . Moreover, the mean-shift vector can be used only for an off-line model 
initialization [12], i.e. the initial set of Gaussian modes of the background pdf is detected 
from an initial sample set. The real-time model is updated by simple heuristics that handle 
mode adaptation, creations, and merging. 

Co-occurrence of image variations - 

This method exploits spatial cooccurrences of image variations (" lfT3lO . It assumes that neigh- 
boring blocks of pixels that belong to the background should have similar variations over 
time. The disadvantage of this method is that it does not handle blocks at the borders of 
distinct background objects. 

This method divides each frame to distinct blocks of N x N pixels where each block is 
regarded as an iV 2 -component vector. This trades-off resolution with high speed and better 
stability. During the learning phase, a certain number of samples is acquired at a set of points, 
for each block. The temporal average is computed and the differences between the samples 
and the average, called the image variations, is calculated. Then the iV 2 x iV 2 covariance 
matrix is computed with respect to the average. An eigenvector transformation is applied to 
reduce the dimensions of the image variations. 

For each block b, a classification phase is performed: the corresponding current eigen-image- 
variations are computed on a neighboring block of b. Then the image variation is expressed 
as a linear interpolation of its L-nearest neighbors in the eigenspace. The same interpolation 
coefficients are applied on the values of b, to provide an estimate for its current eigen-image- 
variations. 
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Eigen-backgrounds - 

This approach is based on an eigen-decomposition of the whole image [fT4l . During a learn- 
ing phase, samples of n images are acquired. The average image is then computed and 
subtracted from all the images. The covariance matrix is computed and the best eigenvectors 
are stored in an eigenvector matrix. For each frame /, a classification phase is executed: / 
is projected onto the eigenspace and then projected back onto the image space. The output 
is the background frame, which does not contain any small moving objects. A threshold is 
applied on the difference between I and the background frame. 

3 Dimensionality reduction 

Dimensionality reduction has been extensively researched. Classic techniques for dimensionality 
reduction such as Principal Component Analysis (PCA) and Multidimensional Scaling (MDS) are 
simple to implement and can be efficiently computed. However, they guarantee to discover the 
true structure of a data set only when the data set lies on or near a linear subspace of the high- 
dimensional input space dfT5l ). These methods are highly sensitive to noise and outliers since they 
take into account the distances between all pairs of points. Furthermore, PCA and MDS fail to 
detect non-linear structures. 

More recent dimensionality reduction methods like Local Linear Embedding (LLE) lfT6l and 
ISOMAP ifTTl amend this pitfall by considering for each point only the distances to its closest 
neighboring points in the data. Recently, Coifman and Lafon lfT8l introduced the Diffusion Maps 
(DM) algorithm which is a manifold learning scheme. DM embeds high dimensional data into an 
Euclidean space of substantially smaller dimension while preserving the geometry of the data set. 
The global geometry is preserved by maintaining the local neighborhood geometry of each point 
in the data set. DM uses a random walk distance that is more robust to noise since it averages all 
the paths between a pair of points. 

Diffusion Bases (DB) - a dual algorithm to the DM algorithm - is described in (A. Schclar 
and A. Averbuch. "Segmentation and anomalies detection in hyper- spectral images via diffusion 
bases", preprint, 2008). The DB algorithm is dual to the DM algorithm in the sense that it explores 
the variability among the coordinates of the original data. Both algorithms share a graph Laplacian 
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construction, however, the DB algorithm uses the Laplacian eigenvectors as an orthonormal system 
on which it projects the original data. 



3.1 Diffusion Bases (DB) 

This section reviews the DB algorithm for dimensionality reduction. Let = {xj}™ 1 , Xi £ W 1 , 
be a data set and let Xi (j) denote the j th coordinate of Xi, 1 < j < n. We define the vector 
Uj = (x% (j) , . . . , x m (j)) as the vector whose components are composed of the j th coordinate of 
all the points in Q,. The DB algorithm consists of the following steps: 

• Construct the data set Q' = {yj} n =1 

• Build a non-directed graph G whose vertices correspond to Q' with a non-negative and fast- 
decaying weight function w £ that corresponds to the local point- wise similarity between the 
points in Q'. By fast decay we mean that given a scale parameter e > we have w £ (yi, yj) — > 
when \\yi — yj\\ 3> e and w £ (yi, yj) — > 1 when \\yi — yj \\ <C e. One of the common choices 
for w £ is 

w £ (yi, Uj) = exp j (1) 

where e defines a notion of neighborhood by defining a e-neighborhood for every point y^. 

• Construction of a random walk on the graph G via a Markov transition matrix P. P is the 
row-stochastic version of w E which is derived by dividing each row of w £ by its surrQ 

• Perform an eigen-decomposition of P to produce the left and the right eigenvectors of P: 
{i/>k} k =i,...,n and {£fc}k=i,...,n' respectively. Let {A fe } fc=1 n be the eigenvalues of P where 
I Ai| > |A 2 | > ... > |A n |. 

• The right eigenvectors of P constitute an orthonormal basis {£fc} fc=1 n > ^fc £ ^- n - These 
eigenvectors capture the non-linear coordinate-wise variability of the original data. 

• Next, we use the spectral decay property of the spectral decomposition to extract only the 
first i] eigenvectors BS = {£fc} fe=1 „ , which contain the non-linear directions with the 
highest variability of the coordinates of the original data set VL. 

l P and the graph Laplacian I — P (see [19]) share the same eigenvectors. 
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• We project the original data onto the basis BS. Let VL BS be the set of these projections: 
^bs = {gi\T=i > 9i e ^ where g { = (xi - £1, . . . , Xi - ^) , i = 1, . . . , m and • denotes 
the inner product operator. Qbs contains the coordinates of the original points in the or- 
thonormal system whose axes are given by BS. Alternatively, £l B s can be interpreted in the 
following way: the coordinates of gi contain the correlation between xi and the directions 
given by the vectors in BS. 

A summary of the DB procedure is given in Algorithm [T] An enhancement of the spectral decom- 
position is described in (A. Schclar and A. Averbuch. "Segmentation and anomalies detection in 
hyper- spectral images via diffusion bases", preprint, 2008). 



4 The Background Subtraction Algorithm using Diffusion Bases 
(BSDB) 

In this section we present the BSDB algorithm. The algorithm has two versions: 
Static background subtraction using DB (SBSDB): We assume that the background is static 



(SBG) - see section 4.1 The video sequence is captured on-line. Gray level images are 



sufficient for the processing. 
Dynamic background substraction using DB (DBSDB): We assume that the background is 



moving (DBG) - see section 4.2 This algorithm uses off-line (training) and on-line (detec- 



tion) procedures. As opposed to the SBSDB, this algorithm requires color (RGB) frames. 
We assume that in both algorithms the camera is static. 



4.1 Static background subtraction algorithm using DB (SBSDB) 

In this section we describe the on-line algorithm that is applied on a video sequence that is captured 
by a static camera. We assume that the background is static. The SBSDB algorithm captures the 
static background, subtracts it from the video sequence and segments the subtracted output. 
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Algorithm 1 The Diffusion Basis algorithm. 



DiffusionBasis(fi', w e , e, rj) 

1. Calculate the weight function w £ (y i) yj) , i,j — l,...n, (Eq. [I]). 

2. Construct a Markov transition matrix P by normalizing the sum of each row in w £ to be 1: 

where d (y^ = £)" =1 w ^ 

3. Perform a spectral decomposition of P 

n 

fe=i 

where the left and the right eigenvectors of P are given by {ipk} and {£fc}> respectively, and 
{Afc} are the eigenvalues of P in descending order of magnitude. 

4. LetA?4{£ fc } fe=w? . 

5. Project the original data onto the orthonormal system 55: 

^bs = {<?ih=i , 9i e 

where 

0i = (a^t • Ci> • • • 5 ^ • €v) ' * = 1 ' • • • ' m > & e BS > 1 - k - V 
and • is the inner product. 

6. return Q B s- 

The input to the algorithm is a sequence of video frames in gray-level format. The algorithm 
produces a binary mask for each video frame. The pixels in the binary mask that belong to the 
background are assigned values while the other pixels are assigned to be 1. 
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4.1.1 Off-line algorithm for capturing static background 

In order to capture the static background of a scene, we reduce the dimensionality of the input 
sequence by applying the DB algorithm (Algorithm [T] in section 3.1 ). The input to the algorithm 
consists of n frames that form a datacube. 

Formally, let D n = {s*„-,i,j = 1, N, t — 1, n} be the input datacube of n frames each 
of size N x N where s\- is the pixel at position in the video frame at time t. We define the 
vector Pij = (s}j, s£j) to be the values of the (i, j) th coordinate at all the n frames in D n . This 
vector is referred to as a hyperpixel. Let Q n = {Pij}, i,j = 1, N be the set of all hyperpixels. 
We define F t = [s\ 1; s l N N ) to be a 1-D vector representing the video frame at time t. We refer 
to F t as a frame- vector. Let Q! n = {F t }" =1 be the set of all frame- vectors. 

We apply the DB algorithm to VL n in ^Bs=DiffusionBasis(f2' n , w £ , e, rj) where w £ is defined by 
Eq. [T] e and r/ are defined in section 3.1 - see Algorithm [TJ The output is the projection of every 



hyperpixel on the diffusion basis which embeds the original data D n into a reduced space. The 
first vector of £Ibs represents the background of the input frames. Let bgy = (xi), i = 1, N 2 be 
this vector. We reshape bgy into the matrix bgu = (xij) ,i,j = 1, N. Then, bgu is normalized 
to be between to 255. The normalized background is denoted by bg M . 

4.1.2 On-line algorithm for capturing a static background 

In order to make the algorithm suitable for on-line applications, the incoming video sequence is 
processed by using a sliding window (SW) of size m. Thus, the number of frames that are input 
to the algorithm is m. Naturally, we seek to minimize m in order to obtain a faster result from the 
algorithm. We found empirically that the algorithm produces good results for values of m as low 
as m = 5, 6 and 7. The delay of 5 to 7 frames is negligible and renders the algorithm to be suitable 
for on-line applications. 

Let S = (si, Si..., s m , s m+ i, s n ) be the input video sequence, we apply the algorithm that 



is described in section 4.1.1 to every SW. The output is a sequence of background frames 



BG = ([bg M ) u •-, (bg M )» (bg M ) m , {bg M ) m+1 , {bg M ) n j (2) 

where (bg M )i is the background that corresponds to frame and (bg M ) n _ m+2 till (bg M ) n are equal 
to (bg M ) n - m+ i. Figure [l] describes how the SW is shifted. 



11 



1 • • 


S(m+1) 


Sm 


• • • 


S3 


S2 


S1 



Wi size m 
W2 size m 



T 



DB output for Wi 



DB output for W2 



Figure 1: Illustration of how the SW is shifted. W\ = (si,...,s m ) is the SW for si. W% = 
(s 2 , s m+1 ) is the SW for s 2 , etc. The backgrounds of Sj and s i+1 are denoted by (bg M )i and 
(bg M ) i+ i, % = 1, n — m + 1, respectively. 

The SW results in a faster execution time of the DB algorithm. The weight function w e (Eq. 
[TJ is not recalculated for all the frame in the SW. Instead, w £ is only updated according to the new 
frame that enters the SW and the one that exits the SW. Specifically, let W t = (s t , s t + m -i) be 
the SW at time t and let W t +i = (st+i, s t + m ) be the SW at time t + 1. At time t + 1, w £ is 
calculated only for s t+m and the entries that correspont to s t are removed from w £ . 

4.1.3 The SBSDB algorithm 



The SBSDB on-line algorithm captures the background of each SW according to section 4.1.2 
Then it subtracts the background from the input sequence and thresholds the output to get the 
background binary mask. 

Let S = (si...,s n ) be the input sequence. For each frame Sj G S, i = 1, ...,n, we do the 
following: 

• Let Wi = (sj, s i+m _i) be the SW of Sj. The on-line algorithm for capturing the back- 



ground (section 4.1.2) is applied to Wi. The output is the background frame (bg 



MJi- 



The SBSDB algorithm subtracts {bg M )i from the original input frame by Sj = Si — {bg M )i 
Then, each pixel in Sj that has a negative value is set to 0. 



A threshold is applied to Sj. The threshold is computed in section 4.1.4 For k, I — 1, N 
the output is defined as follows: 

0, if it is a background pixel; 

1, otherwise. 



Si(k,l) 
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Figure 2: An example how to use the histogram h for finding the threshold value. Th is set to x 
since h'(x) < [i. 

4.1.4 Threshold computation for a grayscale input 

The threshold Th, which separates between background and foreground pixels, is calculated in 
the last step of the SBSDB algorithm. The SBSDB algorithm subtracts the background from the 
input frame and sets pixels with negative values to zero. Usually, the histogram of a frame after 
subtraction will be high at small values and low at high values. The SBSDB algorithm smooths 
the histogram in order to compute the threshold value accurately. 

Let h be the histogram of a frame and let /i be a given parameter which provides a threshold for 
the slope of h. \i is chosen to be the magnitude of the slope where h becomes moderate. We scan 
h from its global maximum to the right. We set the threshold Th to be the smallest value of x that 
satisfies h'(x) < n where h' x is the first derivative of h at point x, i.e. the slope of h at point x. The 
background/foreground classification of the pixels in the input frame Sj is determined according to 
Th. Specifically, for k, I = 1, N 



Fig. [^illustrates how to find the threshold. 

4.2 Dynamic background subtraction algorithm using DB (DBSDB) 

In this section, we describe an on-line algorithm that handles video sequences that are captured 
by a static camera. We assume that the background is dynamic (moving). The DBSDB applies an 




0, if Si(k,l) < Th; 



1, otherwise. 
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Figure 3: The inputs to the DBSDB algorithm. The training is done once on the BGD. It produces 
the background which is input to the DBSDB. The RTD is the on-line input to the DBSDB. 

off-line procedure that captures the dynamic background and an on-line background subtraction 
algorithm. In addition, the DBSDB algorithm segments the video sequence after the background 
subtraction is completed. 

The input to the algorithm consists of two components: 

Background training data: A video sequence of the scene without foreground objects. This 
training data can be obtained from the frames in the beginning of the video sequence. This 
sequence is referred to as the background data (BGD). 

Data for classification: A video sequence that contains background and foreground objects. The 
classification of the objects is performed on-line. We refer to this sequence as the real-time 
data (RTD). 

For both input components, the video frames are assumed to be in RGB - see Fig. |3j 

The algorithm is applied to every video frame and a binary mask is constructed in which the 
pixels that belong to the background are set to while the foreground pixels are set to 1 . 

4.2.1 Iterative method for capturing a dynamic background - training 



The algorithm that is described in section 4.1.3 does not handle well on-going changes in the 



background, such as illumination differences between frames, moving leaves, water flowing, etc. 
In the following, we present a method that is not affected by background changes. An iterative 
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procedure is applied on the BGD in order to capture the movements in the scene. This procedure 
constitutes the training step of the algorithm. 

Let B = (bi,...,b m ) be the BGD input sequence and let bg^ al be the output background 
frame. bg^ al is initialized to zeros. Each iteration contains the following steps: 



Application of the off-line algorithm (section 4. 1 . 1 ) in order to capture the static background 



of B. The BGD is treated as a single sliding window of length m. The output consists of the 
background frames bgu and bg M where bg M is the normalization of bgM- 

• bg M is subtracted from each frame in B by bj = bj — bg M , j = 1, m. In case the input is 
in grayscale format, we set to zero each pixel in bj that has a negative value. The output is 
the sequence B = (pi, ...,b m ). 

i • tit. 7 final l 7 final r final . r 

• 6^Af is added to 6^ by bg J M = bg J M +bg M . 

• B is the input for the next iteration, B = B. 

The iterative process stops when a given number of pixels in B are equal to or smaller than 
zero. Finally, bg M nal is normalized to be between to 255. The normalized background is denoted 
by bg M . The output of this process is composed of bg^ al and bg M . 

4.2.2 The DBSDB algorithm 

In this section, we describe the DBSDB algorithm which handles video sequence that contain a 
dynamic background. The DBSDB algorithm consists of a training phase, which captures the 



BGD (section 4.2.1 ), and a classification phase, which is applied on the RTD. Both phases process 
grayscale and RGB versions of the input and generate grayscale and RGB outputs. The final 
phase combines the output from the grayscale classification phase and the output from the RGB 
classification phase. 

Formally, let S rgh = (s^ 6 , .., s r n 9b ^ and B r ° b = (b[ 9b , b^ be the RTD, which is the on- 
line captured video sequence, and the BGD, which is the off-line video sequence for the training 
phase (section 4.2.1[ ), respectively. 



The DBSDB algorithm consists of the following: 
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1. The grayscale training phase 

• Convert B rgb into grayscale format. The grayscale sequence is denoted by B 9 . 

• Apply the SBSDB algorithm to B 9 excluding the threshold computation, as was done 



in section 



4. 1 .3 The output is a sequence of background frames B 9 . 



• Capture the dynamic background (DBG) in B 9 (section 4.2.1 ). The output is the back- 

- — - final 

ground frame given by (bg M ) g . 
2. The RGB training phase: 



Capture the DBG in each of the RGB channels of B r9b (section 4.2.1 ). The output is the 

- — - final 

background frame denoted by (bg M ) T ' 9 . 

3. The grayscale classification phase: 

S r9b is converted into grayscale format. The grayscale sequence is denoted by S 9 . The 
SBSDB algorithm is applied on S 9 excluding the threshold computation, as it is described in 
section ■ 



4.1.3| This process is performed once or, in some cases, iteratively twice. The output 
is denoted by S 9 . 

For each frame sf G S 9 , i = 1, n, we do the following: 

final ' — - final 

• (bg M ) 9 is subtracted from sf by sf = sf — (bg M ) 9 . Then, each pixel in sf that has 
a negative value is set to 0. 

• A threshold is applied to sf. The threshold is computed as in section 4.1.4 The output 
is set to: 

{0, if it is a background pixel; 
1, otherwise. 

for k, I = 1,...,N. 

4. The RGB classification phase: 

For each frame s r9h G S r9b , i = 1, n, we do the following: 

- — - final h h h final 

• $9 m ) 1S subtracted from sf 9 by sf 9 = s^ 9 — (bg M ) rg . 

• s r9h is normalized to be between to 255. The normalized frame is denoted by s r9b . 
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A threshold is applied to s^ 1 '. The threshold is computed according to section 
The output is set to: 



4.2.2 



sf(k,l) 



0, if it is a background pixel; 

1, otherwise. 



for k, I = 1,...,N. 



5. The DFS phase: 

This phase combines the sf and sY jb from the grayscale and RGB classification phases, re- 
spectively. Since sf contains false negative detections (not all the foreground objects are 
found) and s[ sfc contains false positive detections (background pixels are classified as fore- 
ground pixels), we use each foreground pixel in sf as a reference point from which we begin 



the application of a DFS on s[ 9& (see section |4.2.2[ ) 



Threshold computation for RGB input In the last step of the RGB classification phase in the 
DBSDB algorithm, the thresholds that separate between background pixels and foreground pixels 
are computed for each of the RGB components. The DBSDB algorithm subtracts the background 
from the input frame, therefore, the histogram of a frame after the subtraction is high in the center 
and low at the right and left ends, where the center area corresponds to the background pixels. The 
DBSDB algorithm smooths the histogram in order to compute the threshold values accurately. 

Let h be the histogram and let /i be a given parameter which provides a threshold for the slope 
of h. fx should be chosen to be the value of the slope where h becomes moderate. We denote the 
thresholds to be Th r and Th l . We scan h from its global maximum to the right. Th r = x if x is 
the first coordinate that satisfies h'(x) < fx where h'(x) denotes the first derivative of h at point x, 
i.e. the slope of h at point x. We also scan h from its global maximum to the left. Th l = yifyis 
the first coordinate that satisfies h'(y) > —/x. 

The classification of the pixels in the input frame s r i 9b is determined according to Th r and Th l . 
For each color component and for each k,l = 1, N 

, f 0, if Th l < s^ik.l) <Th r ; 

I 1, otherwise. 

See Figj4]for an example how the thresholds are computed. 
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The process is executed three times, one for each of the RGB channels. The outputs are com- 
bined by a pixel- wise OR operation. 



1 



Th' 







Th r 



1 




h 



— I 1 

y x 

h ! (y)<u h : (x)>-M 



Figure 4: An example that uses the histogram h for finding the threshold values. 

Scan by depth-first search (DFS) The last phase of the DBSDB algorithm is the application of 
a DFS. Let sf G S 9 and G S r9b be the i th output frames of the grayscale and the RGB classifi- 
cation phases, respectively. Each frame is a binary mask represented by a matrix. The DFS phase 
combines both outputs. In sf there are false negative detections and in sY jb there are false positive 
detections. We use each foreground pixel in sf as a reference point from which we begin a DFS 
in s r i 9h . The goal is to find the connected components of the graph whose vertices are constructed 
from the pixels in s r i 9b and whose edges are constructed according to the 8-neighborhood of each 
pixel. 

The graph is constructed as follows: 

• A pixel s[ ff6 (/c, /) is a root if sf (k, I) is a foreground pixel and it has not been classified yet 
as a foreground pixel by the algorithm. 

• A pixel s r9 \k, I) is a node if it is a foreground pixel and was not marked yet as a root. 

• Let s[ fl6 (fc, I) be a node or a root and let Mr k n be a 3 x 3 matrix that represent its 8- 
neighborhood. A pixel s[ 9& (g, r) G Ma.n is a child of s^ 9 (k, I) if s r9h {q, r) is a node (see 
Fig®. 

The DFS is applied from each root in the graph. Each node, that is scanned by the DFS, 
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represents a pixel that belongs to the foreground objects that we wish to find. The scanned pixels 
are marked as the new foreground pixels and the others as the new background pixels. 






1 


1 





-1 


1 


1 









M 



(k,l) 




(k+1,1-1) 



'(k,l) 



Figure 5: n is a graph representation of the neighboring matrix Mn^n of the root pixel s r { 9 (k, I). 
A root pixel is set to -1, a foreground pixel is set to 1 and a background pixel is set to 0. 



4.3 A parallel extension of the SBSDB and the DBSDB algorithms 

We propose parallel extensions to the SBSDB and the DBSDB algorithms. We describe this 
scheme for the SBSDB algorithm and the same scheme can be used for the DBSDB algorithm. 

First, the data cube D n = {s'j, i, j = 1, ...,N,t = 1, ...,n} is decomposed into overlapping 
blocks {/3k,i}- Next, the SBSDB algorithm is independently applied on each block. This step can 
run in parallel. The final result of the algorithm is constructed using the results from each block. 
Specifically, the result from each block is placed at its original location in D n . The result for pixels 
that lie in overlapping areas between adjacent blocks is obtaind by applying a logical OR operation 
on the corresponding blocks results. 



5 Experimental results 

In this section, we present the results from the application of the SBSDB and DBSDB algorithms. 
The section is divided into three parts: The first part is composed from the results of the SBSDB 
algorithm when applied to a SBG video. The second part contains the results from the application 
of the DBSDB algorithm to a DBG video. In the third part we compare between the results obtained 
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Figure 6: The frames that W s contains. The test frame s is the top-left frame. The frames are 
ordered from top-left to bottom-right. 

by our algorithm and those obtained by five other background- subtraction algorithms. 

5.1 Performance analysis of the SBSDB algorithm 

We apply the SBSDB algorithm to a video sequence that consists of 190 grayscale frames of size 
256 x 256. The video sequence was captured by a static camera and is in AVI format with a frame 
rate of 15 fps. The video sequence shows moving cars over a static background. We apply the 
sequential version of the algorithm where the size of the SW is set to 5. We also apply the parallel 
version of the algorithm where the video sequence is divided to four blocks in a 2 x 2 formation. 
The overlapping size between two (either horizontally or vertically) adjacent blocks is set to 20 
pixels and the size of the SW is set to 10. Let s be the test frame and let W s be the SW starting 
at s. In Fig. [6] we show the frames that W s contains. The output of the SBSDB algorithm for s is 
shown in Fig. [7] 

5.2 Performance analysis of the DBSDB algorithm 

We apply the DBSDB algorithm to five video sequences. The first four video sequences are in AVI 
format with a frame rate of 30 fps. The last video sequence is in AVI format with a frame rate 
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Figure 7: (a) The background for the test frame s. (b) The test frame s after the subtraction of 
the background, (c) The output for the test frame s. (d) The output for the test frame s from the 
parallel version of the algorithm. 

of 24 fps. All the video sequences, except the first video sequence, are in RGB format and are of 
size 320 x 240. The first video sequence is of size 210 x 240 and is in RGB format. The video 
sequences were produced by a static camera and contain a dynamic background. 
The input video sequences are: 

1. People walking in front of a fountain. It contains moving objects in the background such 
as water flowing, waving trees and a video screen whose content changes over time. The 
DBSDB input is a RTD that contains 170 frames and a BGD that contains 100 frames. The 
output of the DBSDB is presented in Fig. [8jg). 

2. A person walking in front of bushes with waving leaves. The DBSDB input is a RTD that 
contains 88 frames and a BGD that contains 160 frames. The output of the DBSDB algorithm 
is presented in Fig. |9fg). 

3. A moving ball in front of waving trees. The DBSDB input is a RTD that contains 88 frames 
and a BGD that contains 160 frames. The output of the DBSDB algorithm is presented in 
Fig. [TO} Figure |T0^d) contains the result of the sequential version of the algorithm and Fig. 
[TO^g) contains the results of the parallel version. In results of the parallel version the video 
sequence was divided to four blocks in a 2 x 2 formation. The overlapping size between two 
(either horizontally or vertically) adjacent blocks was set to 20 pixels and the size of SW was 
set to 30. 

4. A ball jumping in front of trees and a car passing behind the trees. The DBSDB input is 
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Figure 8: (a), (d) The original test frames in grayscale and RGB, respectively, (b), (e) The grayscale 
and RGB test frames after the background subtraction in the classification phase of the DBSDB 
algorithm, respectively, (c), (f) Results after the thresholding of (b) and (e), respectively, (g) The 
final output of the DBSDB algorithm after the application of the DFS. 

a RTD that contains 106 frames and a BGD that contains 160 frames. The output of the 



DBSDB algorithm is presented in Fig. 10 e). 



5. A person walking in front of a sprinkler. The DBSDB input is a RTD that contains 121 
frames and a BGD that contains 100 frames. The output of the DBSDB algorithm is pre- 
sented in Fig. [10|T). 



5.3 Performance comparison between the BSDB algorithm and other algo- 
rithms 

We compared between the BSDB algorithm and five different background subtraction algorithms. 
The input data and the results are taken from [20J . All the test sequences were captured by a camera 
that has three CCD arrays. The frames are of size 160x120 in RGB format and are sampled at 4Hz 
. The test frame that was segmented, in video sequences where the background changes, is taken 
to be the frame that appears 50 frames after the frame where the background changes. On every 
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(rt) (c) (I) 



Figure 9: (a), (d) The original test frames in grayscale and RGB, respectively, (b), (e) The grayscale 
and RGB test frames after the background subtraction in the classification phase of the DBSDB 
algorithm, respectively, (c), (f) Results after the thresholding of (b) and (e), respectively, (g) The 
final output of the DBSDB algorithm after the application of the DFS. 

output frame (besides the output of the BSDB algorithm), a speckle removal [20j was applied 
to eliminate islands of 4-connected foreground pixels that contain less than 8 pixels. All other 
parameters were adjusted for each algorithm in order to obtain visually optimal results over the 
entire dataset. The parameters were used for all sequences. Each test sequence begins with at 
least 200 background frames that were used for training the algorithms, except for the bootstrap 
sequence. Objects such as cars, which might be considered foreground in some applications, were 
deliberately excluded from the sequences. 

Each of the sequences poses a different problem in background maintenance. The chosen 
sequences and their corresponding problems are: 

Background object is moved - Problem: A background object can move. These objects should 
not be considered as part of the foreground. The sequence contains a person that walks into a 
conference room, makes a telephone call, and leaves with the phone and a chair in a different 
position. The test frame is the one that appears 50 frames after the person has left the scene. 
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(a) (b) (c) 



□ 




s 












(f) 



□ 



(g) 

Figure 10: (a)-(c) The original test frames, (d)-(g) The segmented outputs from the application of 
the DBSDB algorithm, (a) A ball in front of waving trees, (d), (g) The result of the sequential and 
parallel versions of the algorithm applied on (a), respectively, (b), (e) A ball jumping in front of a 
tree and a car passing behind the trees, (c), (f) A person walking in front of a sprinkler. 

Bootstrapping - Problem: A training period without foreground objects is not available. The 
sequence contains an overhead view of a cafeteria. There is constant motion and every frame 
contains people. 

Waving Trees - Problem: Backgrounds can contain moving objects. The sequence contains a 
person walking in front of a swaying tree. 
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Camouflage - Problem: Pixels of foreground objects may be falsely recognized as background 
pixels. The sequence contains a monitor on a desk with rolling interference bars. A person 
walks into the scene and stands in front of the monitor. 

We apply six background subtraction algorithms on these sequences, including the algorithm 
that is presented in this paper. The background subtraction algorithms are: 

Adjacent Frame Difference - Each frame is subtracted from the previous frame in the sequence. 
Absolute differences greater than a threshold are marked as foreground. 

Mean and Threshold - Pixel- wise mean values are computed during a training phase, and pixels 
within a fixed threshold of the mean are considered background. 

Mean and Covariance - The mean and co variance of pixel values are updated continuously lf2TTl . 
Foreground pixels are determined by applying a threshold to the Mahalanobis distance. 

Mixture of Gaussians - This algorithm is reviewed in section [2} 

Eigen-background - This algorithm is reviewed in section [2| 

BSDB - The algorithm presented in this paper (section [4]). 

The outputs of these algorithms are shown in Fig. [TT 

We applied the SBSDB algorithm on the first two video sequences: the moved chair and the 
bootstrapping. In both cases, the background in the video sequence is static. In the first video 
sequence, the SBSDB algorithm handles the changes in the position of the chair that is a part of 
the background. The SBSDB algorithm does not require a training process so it can handle the 
second video sequence where there is no clear background for training. Algorithms that require a 
training process can not handle this case. 

We applied the DBSDB algorithm on the waving trees and the camouflage video sequences. 
In both cases, the background in the video sequences is dynamic. In the first video sequence, the 
DBSDB algorithm captures the movement of the waving trees and eliminates it from the video 
sequence. The other algorithms produce false positive detections. The DBSDB algorithm does not 
handle well the last video sequence where the foreground object covers the background moving 
object (the monitor). In this case the number of false negative detections is significant. 
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Figure 11: The outputs from the applications of the BSDB and five other algorithms. Each row 
shows the results of one algorithm, and each column represents one problem in background main- 
tenance. The top row shows the test frames. The second row shows the optimal background 
outputs. 
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6 Conclusion and Future Work 



We introduced in this work the BSDB algorithm for automatic segmentation of video sequences. 
The algorithm contains two versions: the SBSDB algorithm for video sequences with static back- 
ground and the DBSDB algorithm for video sequences that contain dynamic background. The 
BSDB algorithm captures the background by reducing the dimensionality of the input via the DB 
algorithm. The SBSDB algorithm uses an on-line procedure while the DBSDB algorithm uses 
an off-line (training) procedure and an on-line procedure. During the training phase, the DBSDB 
algorithm captures the dynamic background by iteratively applying the DB algorithm on the back- 
ground training data. The BSDB algorithm presents a high quality segmentation of the input video 
sequences. Moreover, it was shown that the BSDB algorithm outperforms the current state-of-the- 
art algorithms by coping with difficult situations of background maintenance. 

The performance of the BSDB algorithm can be enhanced by improving the accuracy of the 
threshold values. Furthermore, it is necessary to develop a method for automatic computation of 



//, which is used in the threshold computation (sections 4.1.4 and 4.2.2). 

Additionally, the output of the BSDB algorithm contains a fair amount of false negative detec- 
tions when a foreground object obscures a brighter background object. This will be improved in 
future versions of the algorithm. 

The BSDB algorithm can be useful to achieve low-bit rate video compression for transmission 
of rich multimedia content. The captured background is transmitted once followed by the detected 
segmented objects. 
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